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Abstract Cascade classifiers are widely used in real- 
time object detection. Different from conventional clas- 
sifiers that are designed for a low overall classification 
error rate, a classifier in each node of the cascade is 
required to achieve an extremely high detection rate 
and moderate false positive rate. Although there are 
a few reported methods addressing this requirement 
in the context of object detection, there is no princi- 
pled feature selection method that explicitly takes into 
account this asymmetric node learning objective. We 
provide such an algorithm here. We show that a spe- 
cial case of the biased minimax probability machine has 
the same formulation as the linear asymmetric classifier 
(LAC) of Wu et al. (2005). We then design a new boost- 
ing algorithm that directly optimizes the cost function 
of LAC. The resulting totally-corrective boosting algo- 
rithm is implemented by the column generation tech- 
nique in convex optimization. Experimental results on 
object detection verify the effectiveness of the proposed 
boosting algorithm as a node classifier in cascade ob- 
ject detection, and show performance better than that 
of the current state-of-the-art. 
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1 Introduction 

Real-time object detection inherently involves search- 
ing a large number of candidate image regions for a 
small number of objects. Processing a single image, for 
example, can require the interrogation of well over a 
million scanned windows in order to uncover a single 
correct detection. This imbalance in the data has an 
impact on the way that detectors are applied, but also 
on the training process. This impact is reflected in the 
need to identify discriminative features from within a 
large over-complete feature set. 

Cascade classifiers have been proposed as a poten- 
tial solution to the problem of imbalance in the data 
(Viola and Jones 2004; Bi et al. 2006; Dundar and Bi 
2007; Brubaker et al. 2008; Wu et al. 2008), and have 
received significant attention due to their speed and ac- 
curacy. In this work, we propose a principled method 
by which to train a boosting-hased cascade of classifiers. 

The boosting-based cascade approach to object de- 
tection was introduced by Viola and Jones (Viola and 
Jones 2004; 2002), and has received significant subse- 
quent attention (Li and Zhang 2004; Pham and Cham 
2007b; Pham et al. 2008; Paisitkriangkrai et al. 2008; 
Shen et al. 2008; Paisitkriangkrai et al. 2009). It also 
underpins the current state-of-the-art (Wu et al. 2005; 
2008). 

The Viola and Jones approach uses a cascade of 
increasingly complex classifiers, each of which aims to 
achieve the best possible classification accuracy while 
achieving an extremely low false negative rate. These 
classifiers can be seen as forming the nodes of a de- 
generate binary tree (see Fig. 1) whereby a negative 
result from any single such node classifier terminates 
the interrogation of the current patch. Viola and Jones 
use AdaBoost to train each node classifier in order to 
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achieve the best possible classification accuracy. A low 
false negative rate is achieved by subsequently adjust- 
ing the decision threshold until the desired false nega- 
tive rate is achieved. This process cannot be guaranteed 
to produce the best detection performance for a given 
false negative rate. 

Under the assumption that each node of the cas- 
cade classifier makes independent classification errors, 
the detection rate and false positive rate of the entire 
cascade are: F^i = Yl^^i dt and Ffp — Ht^i /*; respec- 
tively, where dt represents the detection rate of classi- 
fier t, ft the corresponding false positive rate and N the 
number of nodes. As pointed out in (Viola and Jones 
2004; Wu et al. 2005), these two equations suggest a 
node learning objective: Each node should have an ex- 
tremely high detection rate dt {e.g., 99.7%) and a mod- 
erate false positive rate ft {e.g., 50%). With the above 
values of dt and ft, and a cascade of iV = 20 nodes, 
then Fdr ~ 94% and Ftp w 10~^, which is a typical 
design goal. 

One drawback of the standard AdaBoost approach 
to boosting is that it does not take advantage of the cas- 
cade classifier's special structure. AdaBoost only mini- 
mizes the overall classification error and does not par- 
ticularly minimize the number of false negatives. In this 
sense, the features selected by AdaBoost are not opti- 
mal for the purpose of rejecting as many negative ex- 
amples as possible. Viola and Jones proposed a solu- 
tion to this problem in AsymBoost (Viola and Jones 
2002) (and its variants (Pham and Cham 2007b; Pham 
et al. 2008; Wang et al. 2012; Masnadi-Shirazi and Vas- 
concelos 2007)) by modifying the loss function so as 
to more greatly penalize false negatives. AsymBoost 
achieves better detection rates than AdaBoost, but still 
addresses the node learning goal indirectly, and cannot 
be guaranteed to achieve the optimal solution. 

Wu et al. explicitly studied the node learning goal 
and proposed to use linear asymmetric classifier (LAC) 
and Fisher linear discriminant analysis (LDA) to adjust 
the weights on a set of features selected by AdaBoost 
or AsymBoost (Wu et al. 2005; 2008). Their experi- 
ments indicated that with this post-processing tech- 
nique the node learning objective can be better met, 
which is translated into improved detection rates. In 
Viola and Jones' framework, boosting is used to select 
features and at the same time to train a strong classifier. 
Wu et al.'s work separates these two tasks: AdaBoost 
or AsymBoost is used to select features; and as a sec- 
ond step, LAC or LDA is used to construct a strong 
classifier by adjusting the weights of the selected fea- 
tures. The node learning objective is only considered at 
the second step. At the first step — feature selection — 
the node learning objective is not explicitly considered 



at all. We conjecture that further improvement may be 
gained if the node learning objective is explicitly taken 
into account at both steps. We thus propose new boost- 
ing algorithms to implement this idea and verify this 
conjecture. A preliminary version of this work was pub- 
hshed in Shen et al. (2010). 

Our major contributions are as follows. 

1. Starting from the theory of minimax probability 
machines (MPMs), we derive a simplified version 
of the biased minimax probability machine, which 
has the same formulation as the linear asymmet- 
ric classifier of Wu et al. (2005). We thus show the 
underlying connection between MPM and LAC. Im- 
portantly, this new interpretation weakens some of 
the restrictions on the acceptable input data distri- 
bution imposed by LAC. 

2. We develop new boosting- like algorithms by directly 
minimizing the objective function of the linear asym- 
metric classifier, which results in an algorithm that 
we label LACBoost. We also propose FisherBoost 
on the basis of Fisher LDA rather than LAC. Both 
methods may be used to identify the feature set 
that optimally achieves the node learning goal when 
training a cascade classifier. To our knowledge, this 
is the first attempt to design such a feature selection 
method. 

3. LACBoost and FisherBoost share similarities with 
LPBoost (Demiriz et al. 2002) in the sense that both 
use column generation — a technique originally pro- 
posed for large-scale linear programming (LP). Typ- 
ically, the Lagrange dual problem is solved at each 
iteration in column generation. We instead solve 
the primal quadratic programming (QP) problem, 
which has a special structure and entropic gradient 
(EG) can be used to solve the problem very effi- 
ciently. Compared with general interior-point based 
QP solvers, EG is much faster. 

4. We apply LACBoost and FisherBoost to object de- 
tection and better performance is observed over oth- 
er methods (Wu et al. 2005; 2008; Maji et al. 2008). 
In particular on pedestrian detection, FisherBoost 
achieves the state-of-the-art, comparing with meth- 
ods listed in (Dollar et al. 2012) on three bench- 
mark datasets. The results confirm our conjecture 
and show the effectiveness of LACBoost and Fisher- 
Boost. These methods can be immediately applied 
to other asymmetric classification problems. 

Moreover, we analyze the condition that makes the 
validity of LAG, and show that the multi-exit cascade 
might be more suitable for applying LAC learning of 
Wu et al. (2005) and Wu et al. (2008) (and our LAC- 
Boost) rather than Viola- Jones' conventional cascade. 
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As observed in Wu et al. (2008), in many cases, LDA 
even performs better than LAC. In our experiments, we 
have also observed similar phenomena. Paisitkriangkrai 
et al. (2009) empirically showed that LDA's criterion 
can be used to achieve better detection results. An ex- 
planation of why LDA works so well for object detection 
is missing in the literature. Here we demonstrate that 
in the context of object detection, LDA can be seen as 
a regularized version of LAC in approximation. 

The proposed LACBoost/FisherBoost algorithm dif- 
fers from traditional boosting algorithms in that it does 
not minimize a loss function. This opens new possibil- 
ities for designing boosting-like algorithms for special 
purposes. We have also extended column generation for 
optimizing nonlinear optimization problems. Next we 
review related work in the context of real-time object 
detection using cascade classifiers. 

1.1 Related Work 

The field of object detection has made a significant 
progress over the last decade, especially after the sem- 
inal work of Viola and Jones. Three key components 
that contribute to their first robust real-time object de- 
tection framework are: 

1. The cascade classifier, which eflSciently filters out 
negative patches in early nodes while maintaining a 
very high detection rate; 

2. AdaBoost that selects informative features and at 
the same time trains a strong classifier; 

3. The use of integral images, which makes the com- 
putation of Haar features extremely fast. 

This approach has received significant subsequent at- 
tention. A number of alternative cascades have been de- 
veloped including the soft cascade (Bourdev and Brandt 
2005), WaldBoost (Sochman and Matas 2005), the dy- 
namic cascade (Xiao et al. 2007), the AND-OR cascade 
(Dundar and Bi 2007), the multi-exit cascade (Pham 
et al. 2008), the joint cascade (Lefakis and Fleuret 2010) 
and recently proposed, the rate constraint embedded 
cascade (RCECBoost) (Saberian and Vasconcelos 2012). 
In this work we have adopted the multi-exit cascade of 
Pham et al. due to its effectiveness and efficiency as 
demonstrated in Pham et al. (2008). The multi-exit cas- 
cade improves classification performance by using the 
results of all of the weak classifiers applied to a patch 
so far in reaching a decision at each node of the tree 
(see Fig. 1). Thus the n-th node classifier uses the re- 
sults of the weak classifiers associated with node n, but 
also those associated with the previous n—1 node clas- 
sifiers in the cascade. We show below that LAC post- 
processing can enhance the multi-exit cascade, and that 



the multi-exit cascade more accurately fulfills the LAC 
requirement that the margin be drawn from a Gaussian 
distribution. 

In addition to improving the cascade structure, a 
number of improvements have been made on the learn- 
ing algorithm for building node classifiers in a cascade. 
Wu et al., for example, use fast forward feature selection 
to accelerate the training procedure (Wu et al. 2003). 
Wu et al. (2005) also showed that LAC may be used 
to deliver better classification performance. Pham and 
Cham recently proposed online asymmetric boosting 
that considerably reduces the training time required 
(Pham and Cham 2007b). By exploiting the feature 
statistics, Pham and Cham (2007a) have also designed 
a fast method to train weak classifiers. Li and Zhang 
(2004) proposed FloatBoost, which discards redundant 
weak classifiers during AdaBoost 's greedy selection pro- 
cedure. Masnadi-Shirazi and Vasconcelos (2011) pro- 
posed cost-sensitive boosting algorithms which can be 
applied to different cost-sensitive losses by means of 
gradient descent. Liu and Shum (2003) also proposed 
KLBoost, aiming to select features that maximize the 
projected KuUback-Leibler divergence and select fea- 
ture weights by minimizing the classification error. Pro- 
mising results have also been reported by LogitBoost 
(Tuzel et al. 2008) that employs the logistic regression 
loss, and GentleBoost (Torralba et al. 2007) that uses 
adaptive Newton steps to fit the additive model. Multi- 
instance boosting has been introduced to object detec- 
tion (Viola et al. 2005; Dollar et al. 2008; Lin et al. 
2009), which does not require precisely labeled loca- 
tions of the targets in training data. 

New features have also been designed for improv- 
ing the detection performance. Viola and Jones' Haar 
features are not sufficiently discriminative for detecting 
more complex objects like pedestrians, or multi-view 
faces. Covariance features (Tuzel et al. 2008) and his- 
togram of oriented gradients (HOG) (Dalai and Triggs 
2005) have been proposed in this context, and efficient 
implementation approaches (along the lines of integral 
images) are developed for each. Shape context, which 
can also exploit integral images (Aldavert et al. 2010), 
was applied to human detection in thermal images (Wa- 
ng et al. 2010). The local binary pattern (LBP) descrip- 
tor and its variants have been shown promising per- 
formance on human detection (Mu et al. 2008; Zheng 
et al. 2010). Recently, effort has been spent on com- 
bining complementary features, including: simple con- 
catenation of HOG and LBP (Wang et al. 2007), com- 
bination of heterogeneous local features in a boosted 
cascade classifier (Wu and Nevatia 2008), and Bayesian 
integration of intensity, depth and motion features in a 
mixture-of-experts model (Enzweiler et al. 2010). 



4 



Chunhua Shen et al. 




■ ' 5 hji—i. hji 

T target 
N > 



input 



■^«- 1 



h\,h2,--- hj,hj^i, 



> 2 I 



T T target 
N > 



Fig. 1: Cascade classifiers. The first one is tlie standard cascade of Viola and Jones (2004). The second one is the 
multi-exit cascade proposed in Pham et al. (2008). Only those classified as true detection by all nodes will be true 
targets. 



The rest of the paper is organized as follows. We 
briefly review the concept of minimax probability ma- 
chine and derive the new simplified version of biased 
minimax probability machine in Section 2. Linear asym- 
metric classification and its connection to the minimax 
probability machine is discussed in Section 3. In Sec- 
tion 4, we show how to design new boosting algorithms 
(LACBoost and FisherBoost) by rewriting the opti- 
mization formulations of LAC and Fisher LDA. The 
new boosting algorithms are applied to object detec- 
tion in Section 5 and we conclude the paper in Section 
6. 



1.2 Notation 

The following notation is used. A matrix is denoted by a 
bold upper-case letter (X); a column vector is denoted 
by a bold lower-case letter (x). The ith row of X is 
denoted by X^. and the ith column X.j. The identity 
matrix is I and its size should be clear from the context. 
1 and are column vectors of I's and O's, respectively. 
We use =^ to denote component-wise inequalities. 

Let T — {(xi, ?/i)}i=i_... _m be the set of training 
data, where G A" and yi £ {—1, +1}, Vi. The train- 
ing set consists of mi positive training points and m2 
negative ones; mi -|- m2 = m. Let /i(-) G H he a weak 
classifier that projects an input vector x into {—1,-1-1}. 
Note that here we consider only classifiers with discrete 
outputs although the developed methods can use real- 
valued weak classifiers too. We assume that H, the set 
from which h{-) is selected, is finite and has n elements. 



Define the matrix G 



such that the 



entry Hfj — hj{xi) is the label predicted by weak clas- 
sifier hj{-) for the datum x^, where x^ the zth element of 
the set Z. In order to simplify the notation we eliminate 
the superscript when Z is the training set, so = H. 
Therefore, each column H:^- of the matrix H consists 
of the output of weak classifier hj{-) on all the train- 
ing data; while each row H^. contains the outputs of all 
weak classifiers on the training datum x^. Define simi- 



larly the matrix A G 



such that A, 



yihj{yii). 



Note that boosting algorithms entirely depends on the 
matrix A and do not directly interact with the train- 
ing examples. Our following discussion will thus largely 
focus on the matrix A. We write the vector obtained 
by multiplying a matrix A with a vector w as Aw and 
its ith entry as (Aw),. If we let w represent the co- 
efficients of a selected weak classifier then the margin 
of the training datum x; is pi = Ai:W = (Aw)^ and 
the vector of such margins for all of the training data 
is p = Aw. 



2 Minimax Probability Machines 

Before we introduce our boosting algorithm, let us briefly 
review the concept of minimax probability machines 
(MPM) (Lanckriet et al. 2002) first. 



2.1 Minimax Probability Classifiers 

Let xi G M" and X2 G M" denote two random vectors 
drawn from two distributions with means and covari- 
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ances (/Lt^, ^i) and {fi2i ^2), respectively. Here 1^1,^2 € 
M" and iJi, € M"""". We define the class labels of xi 
and X2 as +1 and —1, w.l.o.g. The minimax probability 
machine (MPM) seeks a robust separation hyperplane 
that can separate the two classes of data with the max- 
imal probability. The hyperplane can be expressed as 
w^x = b with w e M"\{0} and beR. The problem of 
identifying the optimal hyperplane may then be formu- 
lated as 



max 7 s.t. 

w,b,7 



inf Pr{w^xi > b} 

xi~(/J.i,-S'i) 

inf Pr{w^X2 < b} 

X2~(M2.-S2) 



>7, 
>7- 



(1) 



Here 7 is the lower bound of the classification accuracy 
(or the worst-case accuracy) on test data. This problem 
can be transformed into a convex problem, more specif- 
ically a second-order cone program (SOCP) (Boyd and 
Vandenberghe 2004) and thus can be solved efficiently 
(Lanckriet et al. 2002). 



2.2 Biased Minimax Probability Machines 

The formulation (1) assumes that the classification prob- 
lem is balanced. It attempts to achieve a high recog- 
nition accuracy, which assumes that the losses associ- 
ated with all mis-classifications are identical. However, 
in many applications this is not the case. 

Huang et al. (2004) proposed a biased version of 
MPM through a slight modification of (1), which may 
be formulated as 



max 7 s.t. 

w,b,7 



inf Prjw^Xi > 6} 
inf Pr{w^X2 < b} 



>7, 
>7o 



(2) 



Here 70 G (0, 1) is a prescribed constant, which is the 
acceptable classification accuracy for the less important 
class. The resulting decision hyperplane prioritizes the 
classification of the important class Xi over that of the 
less important class X2. Biased MPM is thus expected 
to perform better in biased classification applications. 

Huang et al. showed that (2) can be iteratively solved 
via solving a sequence of SOCPs using the fractional 
programming (FP) technique. Clearly it is significantly 
more computationally demanding to solve (2) than (1). 

Next we show how to re-formulate (2) into a simpler 
quadratic program (QP) based on the recent theoretical 
results in (Yu et al. 2009). 



2.3 Simplified Biased Minimax Probability Machines 

In this section, we are interested in simplifying the prob- 
lem of (2) for a special case of 70 — 0.5, due to its 
important application in object detection (Viola and 
Jones 2004; Wu et al. 2005). In the following discus- 
sion, for simplicity, we only consider 70 = 0.5 although 
some algorithms developed may also apply to 70 < 0.5. 

Theoretical results in (Yu et al. 2009) show that, the 
worst-case constraint in (2) can be written in different 
forms when x follows arbitrary, symmetric, symmetric 
unimodal or Gaussian distributions (see Appendix A). 
Both the MPM (Lanckriet et al. 2002) and the biased 
MPM (Huang et al. 2004) are based the most general 
form of the four cases shown in Appendix A, i.e., Equa- 
tion (27) for arbitrary distributions, as they do not im- 
pose constraints upon the distributions of Xi and X2 . 

However, one may take advantage of structural in- 
formation whenever available. For example, it is shown 
in (Wu et al. 2005) that, for the face detection problem, 
weak classifier outputs can be well approximated by the 
Gaussian distribution. In other words, the constraint for 
arbitrary distributions does not utilize any type of a 
priori information, and hence, for many problems, con- 
sidering arbitrary distributions for simplifying (1) and 
(2) is too conservative. Since both the MPM (Lanckriet 
et al. 2002) and the biased MPM (Huang et al. 2004) do 
not assume any constraints on the distribution family, 
they fail to exploit this structural information. 

Let us consider the special case of 70 — 0.5. It is easy 
to see that the worst-case constraint in (2) becomes a 
simple linear constraint for symmetric, symmetric uni- 
modal, as well as Gaussian distributions (see Appendix 
A). As pointed out in (Yu et al. 2009), such a result 
is the immediate consequence of symmetry because the 
worst-case distributions are forced to put probability 
mass arbitrarily far away on both sides of the mean. 
In such case, any information about the covariance is 
neglected. 

We now apply this result to the biased MPM as 
represented by (2). Our main result is the following 
theorem. 

Theorem 1 With 70 = 0.5, the biased minimax prob- 
lem (2) can be formulated as an unconstrained problem: 



(3) 



under the assumption that X2 follows a symmetric dis- 
tribution. The optimal b can be obtained through: 



W fj,2. 



(4) 
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The worst-case classification accuracy for the first class, 
7*, is obtained by solving 



where 



-b* + a*' fj,^ 
\l w*^Z'iw* 



(5) 



1^(7) = < 




z/xi - (Mij^'i), 

z/xi - (/Xi,Z'i)s, 
i/xi (/^i, Z'i)su, 



(6) 



and {w*,6*} is the optimal solution of (3) and (4). 

Please refer to Appendix A for the proof of Theorem 1. 

We have derived the biased MPM algorithm from a 
different perspective. We reveal that only the assump- 
tion of symmetric distributions is needed to arrive at a 
simple unconstrained formulation. Compared with the 
approach in (Huang et al. 2004), we have used more 
information to simply the optimization problem. More 
importantly, as will be shown in the next section, this 
unconstrained formulation enables us to design a new 
boosting algorithm. 

There is a close connection between our algorithm 
and the linear asymmetric classifier (LAC) in (Wu et al. 
2005). The resulting problem (3) is exactly the same as 
LAC in (Wu et al. 2005). Removing the inequality in 
this constraint leads to a problem solvable by eigen- 
decomposition. We have thus shown that the results of 
Wu et al. may be generalized from the Gaussian dis- 
tributions assumed in (Wu et al. 2005) to symmetric 
distributions. 



3 Linear Asymmetric Classification 

We have shown that starting from the biased minimax 
probability machine, we are able to obtain the same 
optimization formulation as shown in Wu et al. (2005), 
while much weakening the underlying assumption (sym- 
metric distributions versus Gaussian distributions) . Be- 
fore we propose our LACBoost and FisherBoost, how- 
ever, we provide a brief overview of LAC. 

Wu et al. (2008) proposed linear asymmetric clas- 
sification (LAC) as a post-processing step for training 
nodes in the cascade framework. In (Wu et al. 2008) , it 
is stated that LAC is guaranteed to reach an optimal 
solution under the assumption of Gaussian data distri- 
butions. We now know that this Gaussianality condition 
may be relaxed. 

Suppose that we have a linear classifier 



We seek a {w, b} pair with a very high accuracy on the 
positive data xi and a moderate accuracy on the nega- 
tive X2. This can be expressed as the following problem: 



max Pr {w xi > 51, 
s.t. Pr {w^X2 < 6} = A. 



(7) 



In (Wu et al. 2005), A is set to 0.5 and it is assumed 
that for any w, w^xi is Gaussian and w^X2 is sym- 
metric, (7) can be approximated by (3). Again, these 
assumptions may be relaxed as we have shown in the 
last section. Problem (3) is similar to LDA's optimiza- 
tion problem 



- M2) 

max — , 

w#0 y/w^iSi + i:2)w 



(8) 



Problem (3) can be solved by eigen-decomposition and 
a closed-form solution can be derived: 



(9) 



On the other hand, each node in cascaded boosting clas- 
sifiers has the following form: 



/(x)=sign(wTH(x)-6). 



(10) 



/(x) — sign(w^x — 6). 



We override the symbol H(x) here, which denotes the 
output vector of all weak classifiers over the datum x. 
We can cast each node as a linear classifier over the 
feature space constructed by the binary outputs of all 
weak classifiers. For each node in a cascade classifier, we 
wish to maximize the detection rate while maintaining 
the false positive rate at a moderate level (for example, 
around 50.0%). That is to say, the problem (3) repre- 
sents the node learning goal. Boosting algorithms such 
as AdaBoost can be used as feature selection methods, 
and LAC is used to learn a linear classifier over those bi- 
nary features chosen by boosting as in Wu et al. (2005). 
The advantage of this approach is that LAC considers 
the asymmetric node learning explicitly. 

However, there is a precondition on the validity of 
LAC that for any w, w^xi is a Gaussian and w^X2 is 
symmetric. In the case of boosting classifiers, w^xi and 
w^X2 can be expressed as the margin of positive data 
and negative data, respectively. Empirically Wu et al. 
(2008) verified that w^x is approximately Gaussian for 
a cascade face detector. We discuss this issue in more 
detail in Section 5. Shen and Li (2010b) theoretically 
proved that under the assumption that weak classifiers 
are independent, the margin of AdaBoost follows the 
Gaussian distribution, as long as the number of weak 
classifiers is sufficiently large. In Section 5 we verify this 
theoretical result by performing the normality test on 
nodes with different number of weak classifiers. 
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4 Constructing Boosting Algorithms from LDA 
and LAC 

In kernel methods, the original data are nonlinearly 
mapped to a feature space by a mapping function ^{•). 
The function need not be known, however, as rather 
than being applied to the data directly, it acts instead 
through the inner product <P{xi)^'F{xj). In boosting 
(Ratsch et al. 2002), however, the mapping function 
can be seen as being explicitly known, as tf'(x) : x i-> 
[/ii(x), . . . , /i„(x)]. Let us consider the Fisher LDA case 
first because the solution to LDA will generalize to LAC 
straightforwardly, by looking at the similarity between 
(3) and (8). 

Fisher LDA maximizes the between-class variance 
and minimizes the within-class variance. In the binary- 
class case, the more general formulation in (8) can be 
expressed as 



ifJ-i -M2) 

CTl + (72 



w^CfcW 



(11) 



where Cf, and C^, are the between-class and within- 
class scatter matrices; and are the projected 
centers of the two classes. The above problem can be 
equivalently reformulated as 



min w^C^jW — — /Xj); 



(12) 



for some certain constant 6 and under the assumption 
that — > 0} Now in the feature space, our data 
are !f'(xi), i = 1 ... to. Define the vectors e, ei, 62 S K™ 
such that e = ei + 62, the i-th entry of ei is 1/toi if 
iji = +1 and otherwise, and the i-th entry of 62 is 
I/TO2 if = —1 and otherwise. We then see that 



Ml 



-L 1 -L ^. 1 



TOl 

1 

TOl 



e[Aw, 



(13) 



and 
M2 = 



TO2 



E 



If (x,) = — E H»:W = -e^Aw, 



TO2 



yi=-i 



(14) 



For ease of exposition we order the training data ac- 
cording to their labels so the vector e e M™: 



e = [1/toi, • • • , I/TO2, 



(15) 



and the first TOi components of p correspond to the pos- 
itive training data and the remaining ones correspond 

^ In our object detection experiment, we found that this 
assumption can always be satisfied. 



to the TO2 negative data. We now see that /x^ — /Xj = 
e^p, Cu) = toi/to • Si + 1712 /m ■ ^2 with -£'1.2 the 
covariance matrices. Noting that 



W^Ei;2W= — E {pt-Pkf 

toi,2(toi,2 - 1) ^ , , 



we can easily rewrite the original problem (11) (and 
(12)) into: 



min Ip Qp — 9e p. 



w.p 



s.t. w ;?= 0, i^w = 1 



Pi = (Aw)i,i = 1, • • • ,TO. 



(16) 



Here Q 



Qi 



Qi 
Q2 



is a block matrix with 



n m(mi — 1) 

1 1 



7n(mi —1) m 



m(7ni — 1) 
1 

m(mi — 1) 



m(mi— 1) m(mi — 1) ' ' ' m 

and Q2 is similarly defined by replacing mi with m2 in 
Qi: 



Q2 = 



1 



m(m2 



m(m2 — 1) 
1 

m(m2 — 1) 



m(m2 — 1) m{m2 — l) 



we set Q 



then (16) becomes the optimization 



Also note that we have introduced a constant ^ before 
the quadratic term for convenience. The normalization 
constraint l^w — 1 removes the scale ambiguity of w. 
Without it the problem is ill-posed. 

We see from the form of (3) that the covariance of 
the negative data is not involved in LAC and thus that if 

"Qi 0" 


problem of LAC. 

At this stage, it remains unclear about how to solve 
the problem (16) because we do not know all the weak 
classifiers. There may be extremely (or even infinitely) 
many weak classifiers in %, the set from which h[-) is 
selected, meaning that the dimension of the optimiza- 
tion variable w may also be extremely large. So (16) 
is a semi-infinite quadratic program (SIQP). We show 
how column generation can be used to solve this prob- 
lem. To make column generation applicable, we need to 
derive a specific Lagrange dual of the primal problem. 
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4.1 The Lagrange Dual Problem 

We now derive the Lagrange dual of the quadratic prob- 
lem (16). Although we are only interested in the vari- 
able w, we need to keep the auxiliary variable p in order 
to obtain a meaningful dual problem. The Lagrangian 
of (16) is 

L( w, p , u, r ) = ip^Qp - 6e^ p + u^(p - Aw) — q^w 

primal dual 

+ r(l"^w-l), (17) 

with q ^ 0. supu^ infw,p i(w, p, u, r) gives the follow- 
ing Lagrange dual: 

regularization 
, ^ . 

max -r-\{\i-eefqr^{\i-ee), 

m 

s.t. ^lijA,: ^ rl"^. (18) 

In our case, Q is rank-deficient and its inverse does 
not exist (for both LDA and LAC). Actually for both 
Qi and Q2, they have a zero eigenvalue with the cor- 
responding eigenvector being all ones. This is easy to 
see because for Qi and Q2, the sum of each row (or 
each column) is zero. We can simply regularize Q with 
Q -I- (51 with 5 a small positive constant. Actually, Q is 
a diagonally dominant matrix but not strict diagonal 
dominance. So Q -t- (51 with any (5 > is strict diago- 
nal dominance and by the Gershgorin circle theorem, a 
strictly diagonally dominant matrix must be invertible. 

One of the KKT optimality conditions between the 
dual and primal is 

p* = -Q-\n* -ee), (19) 

which can be used to establish the connection between 
the dual optimum and the primal optimum. This is ob- 
tained by the fact that the gradient of L w.r.t. p must 
vanish at the optimum, dL/dpi = 0, Vi = 1 • • • rt. 

Problem (18) can be viewed as a regularized LP- 
Boost problem. Compared with the hard-margin LP- 
Boost (Demiriz et al. 2002), the only difference is the 
regularization term in the cost function. The duality 
gap between the primal (16) and the dual (18) is zero. 
In other words, the solutions of (16) and (18) coincide. 
Instead of solving (16) directly, one calculates the most 
violated constraint in (18) iteratively for the current 
solution and adds this constraint to the optimization 
problem. In theory, any column that violates dual fea- 
sibility can be added. To speed up the convergence, we 



add the most violated constraint by solving the follow- 
ing problem: 

m 

^'(•) = argmax^(.-) ^ Mjyj/i(x.,). (20) 

This is exactly the same as the one that standard Ad- 
aBoost and LPBoost use for producing the best weak 
classifier at each iteration. That is to say, to find the 
weak classifier that has the minimum weighted train- 
ing error. We summarize the LACBoost/FisherBoost 
algorithm in Algorithm 1. By simply changing Q2, Al- 
gorithm 1 can be used to train either LACBoost or 
FisherBoost. Note that to obtain an actual strong clas- 
sifier, one may need to include an offset 6, i.e. the final 
classifier is Yl^=i ^i(^) ~ ^ because from the cost func- 
tion of our algorithm (12), we can see that the cost 
function itself does not minimize any classification er- 
ror. It only finds a projection direction in which the 
data can be maximally separated. A simple line search 
can find an optimal b. Moreover, when training a cas- 
cade, we need to tune this offset anyway as shown in 
(10). 

The convergence of Algorithm 1 is guaranteed by 
general column generation or cutting-plane algorithms, 
which is easy to establish: 

Theorem 2 The column generation procedure decreases 
the objective value of problem (16) at each iteration and 
hence in the limit it solves the problem (16) globally to 
a desired accuracy. 

The proof is deferred to Appendix B. In short, when 
a new h'{-) that violates dual feasibility is added, the 
new optimal value of the dual problem (maximization) 
would decrease. Accordingly, the optimal value of its 
primal problem decreases too because they have the 
same optimal value due to zero duality gap. Moreover 
the primal cost function is convex, therefore in the end 
it converges to the global minimum. 

At each iteration of column generation, in theory, 
we can solve either the dual (18) or the primal problem 
(16). Here we choose to solve an equivalent variant of 
the primal problem (16): 

min iw^(A^QA)v^r - (6'e^A)w, s.t. v^r (E Z\„, (21) 

w 

where Z\„ is the unit simplex, which is defined as {w g 
M" : l^w = l,w ^ 0}. 

In practice, it could be much faster to solve (21) 
since 

1. Generally, the primal problem has a smaller size, 
hence faster to solve. The number of variables of 
(18) is m at each iteration, while the number of 
variables is the number of iterations for the primal 
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Algorithm 1 Column generation for SIQP. 
Input : Labeled training data (x^ ,yi),i = 1 ■ ■ ■ m; termina- ^ 

tion tfireshold e > 0; regularization parameter 9; 

maximum number of iterations nmax- 
Initialization: m = 0; w = 0: and Mi = — , j = 1- ■ ■ m. i 
for iteration = 1 : rimax do 

— Cfieck for ttie optimality: 

if iteration > 1 and "^ZTLi ^iVih'i^i) <r -\- e, S' 
then 0.5 
break; and the problem is solved; 

— Add h'{-) to the restricted master problem, which 
corresponds to a new constraint in the dual; ^ 

— Solve the dual problem (18) (or the primal problem 
(16)) and update r and Ui (i = 1 ■ • • m). 

— Increment the number of weak classifiers n = n + 1. 

Output: The selected features are hi, h2, ■ ■ . , h„. The fi- 
nal strong classifier is: -F(x) = X]"=i 

Here the offset b can be learned ijy a simple line ^ ^ 

search. 




problem. For example, in Viola- Jones' face detection 
framework, the number of training data m = 10, 000 
and n,nax — 200. In other words, the primal problem 
has at most 200 variables in this case; 
2. The dual problem (18) is a standard QP problem. 
It has no special structure to exploit. As we will 
show, the primal problem (21) belongs to a special 
class of problems and can be efficiently solved us- 
ing entropic/exponentiated gradient descent (EG) 
(Beck and Teboulle 2003; Colhns et al. 2008). See 
Appendix C for details of the EG algorithm. 
A fast QP solver is extremely important for training 
our object detector since we need to solve a few 
thousand QP problems. Compared with standard 
QP solvers like Mosek (MOSEK 2010), EG is much 
faster. EG makes it possible to train a detector using 
almost the same amount of time as using standard 
AdaBoost because the majority of time is spent on 
weak classifier training and bootstrapping. 

We can recover both of the dual variables u*, r* eas- 
ily from the primal variable w*, p*: 



1 



0.5 







Fig. 2: Decision boundaries of AdaBoost (top) and 
FisherBoost (bottom) on 2D artificial data generated 
from the Gaussian distribution (positive data repre- 
sented by D's and negative data by x's). Weak classi- 
fiers are vertical and horizontal decision stumps. Fisher- 
Boost emphasizes more on positive samples than neg- 
ative samples. As a result, the decision boundary of 
FisherBoost is more similar to the Gaussian distribu- 
tion than the decision boundary of AdaBoost. 

5 Experiments 




u* = Qp* + 0e; (22) 
r*- max {E:1i<A,,}. (23) 

7 = 1. ..n'- 



In this section, we perform our experiments on both 
synthetic and challenging real- world data sets, e.g., face 
and pedestrian detection. 



The second equation is obtained by the fact that in 
the dual problem's constraints, at optimum, there must 
exist at least one u* such that the equality holds. That 
is to say, r* is the largest edge over all weak classifiers. 

In summary, when using EG to solve the primal 
problem, Line 5 of Algorithm 1 is: 

— Solve the primal problem (21) using EG, and up- 
date the dual variables u with (22), and r with (23). 



5.1 Synthetic Testing 

We first illustrate the performance of FisherBoost on 
an asymmetrical synthetic data set where there are a 
large number of negative samples compared to the pos- 
itive ones. Fig. 2 demonstrates the subtle difference in 
classification boundaries between AdaBoost and Fisher- 
Boost. It can be observed that FisherBoost places more 
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emphasis on positive samples than negative samples to 
ensure these positive samples would be classified cor- 
rectly. AdaBoost, on the other hand, treat both posi- 
tive and negative samples equally. This might be due to 
the fact that AdaBoost only optimizes the overall clas- 
sification accuracy. This finding is consistent with our 
results reported earlier in (Paisitkriangkrai et al. 2009; 
Shen et al. 2011). 

5.2 Comparison With Other Asymmetric Boosting 

In this experiment, FisherBoost and LACBoost are com- 
pared against several asymmetric boosting algorithms, 
namely, AdaBoost with LAC or Fisher LDA post-pro- 
cessing (Wu et al. 2008), AsymBoost (Viola and Jones 
2002), cost-sensitive AdaBoost (CS-ADA) (Masnadi- 
Shirazi and Vasconcelos 2011) and rate constrained boo- 
sting (RCBoost) (Saberian and Vasconcelos 2012). The 
results of AdaBoost are also presented as the baseline. 
For each algorithm, we train a strong classifier consist- 
ing of 100 weak classifiers along with their coefficients. 
The threshold was determined such that the false pos- 
itive rate of test set is 50%. For every method, the ex- 
periment is repeated 5 times and the average detec- 
tion rate on positive class is reported. For FisherBoost 
and LACBoost, the parameter is chosen from {1/10, 
1/12, 1/15, 1/20} by cross-validation. For AsymBoost, 
we choose k (asymmetric factor) from {2*^-^, 2°-^, • • • , 
20.5 1 |-,y cross-validation. For CS-ADA, we set the cost 
for misclassifying positive and negative data as follows. 
We assign the asymmetric factor k = C1/C2 and re- 
strict 0.5(Ci + C2) = 1. We choose k from {1.2, 1.65, 
2.1, 2.55, 3} by cross-validation. For RCBoost, we con- 
duct two experiments. In the first experiment, we use 
the same training set to enforce the target detection 
rate, while in the second experiment; we use 75% of the 
training data to train the model and the other 25% to 
enforce the target detection rate. We set the target de- 
tection rate, I?t, to 99.5%, the barrier coefiicient, 7, to 
2 and the number of iterations before halving 7, A^^, to 
10. 

We tested the performance of all algorithms on five 
real-world data sets, including both machine learning 
(USPS) and vision data sets (cars, faces, pedestrians, 
scenes). We categorized USPS data sets into two classes: 
even digits and odd digits. For faces, wc use face data 
sets from (Viola and Jones 2004) and randomly ex- 
tract 5000 negative patches from background images. 
We apply principle component analysis (PCA) to pre- 
serve 95% total variation. The new data set has a di- 
mension of 93. For UIUC car (Agarwal et al. 2004), we 
downsize the original image from 40 x 100 pixels to 20 x 
50 pixels and apply PCA. The projected data capture 



95% total variation and has a final dimension of 228. 
For Daimler-Chrysler pedestrian data sets (Munder and 
Gavrila 2006), we apply PCA to the original 18 x 36 pix- 
els. The projected data capture 95% variation and has 
a final dimension of 139. For indoor/outdoor scene, we 
divide the 15-scene data set used in (Lazebnik et al. 
2006) into 2 groups: indoor and outdoor scenes. We 
use CENTRIST as our feature descriptors and build 
50 visual code words using the histogram intersection 
kernel (Wu and Rehg 2011). Each image is represented 
in a spatial hierarchy manner. Each image consists of 
31 sub- windows. In total, there are 1550 feature dimen- 
sions per image. All 5 classifiers are trained to remove 
50% of the negative data, while retaining almost all pos- 
itive data. We compare their detection rates in Table 1. 
From our experiments, FisherBoost demonstrates the 
best performance on most data sets. However, LAC- 
Boost does not perform as well as expected. We sus- 
pect that the poor performance might partially due to 
numerical issues, which can cause overfitting. We will 
discuss this in more detail in Section 5.6. 



5.3 Face Detection Using a Cascade Classifier 

In this experiments, eight asymmetric boosting meth- 
ods are evaluated with the multi-exit cascade (Pham 
et al. 2008), which are FisherBoost/LACBoost, Ad- 
aBoost alone or with LDA/LAC post-processing (Wu 
et al. 2008), AsymBoost alone or with LDA/LAC post- 
processing. We have also implemented Viola- Jones' face 
detector (AdaBoost with the conventional cascade) as 
the baseline (Viola and Jones 2004). Furthermore, our 
face detector is also compared with state-of-the-art in- 
cluding some cascade design methods, i.e., WaldBoost 
(Sochman and Matas 2005), FloatBoost (Li and Zhang 
2004), Boosting Chain (Xiao et al. 2003) and the exten- 
sion of (Saberian and Vasconcelos 2010), RCECBoost 
(Saberian and Vasconcelos 2012). The algorithm for 
training a multi-exit cascade is summarized in Algo- 
rithm 2. 

We first illustrate the validity of adopting LAC and 
Fisher LDA post-processing to improve the node learn- 
ing objective in the cascade classifier. As described abo- 
ve, LAC and LDA assume that the margin of the train- 
ing data associated with the node classifier in such a 
cascade exhibits a Gaussian distribution. We demon- 
strate this assumption on the face detection task in 
Fig. 3. Fig. 3 shows the normal probability plot of the 
margins of the positive training data for the first three 
node classifiers in the multi-exit LAC classifier. The fig- 
ure reveals that the larger the number of weak classifiers 
used the more closely the margins follow the Gaussian 
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AdaBoost 


LAC 


FLDA 


AsymBoost 


CS-ADA 


RCBoosti 


RCBoost^ 


LACBoost 


FisherBoost 


Digits 


99.30 (0.10) 


99.30 (0.21) 


99.37 (0.08) 


99.40 (0.11) 


99.37 (0.09) 


99.36 (0.17) 


99.27 (0.15) 


99.12 (0.07) 


99.40 (0.13) 


Faces 


98.70 (0.14) 


98.78 (0.42) 


98.86 (0.22) 


98.73 (0.14) 


98.71 (0.20) 


98.75 (0.18) 


98.66 (0.23) 


98.63 (0.29) 


98.89 (0.15) 


Cars 


97.02 (1.55) 


97.07 (1.34) 


97.02 (1.50) 


97.11 (1.36) 


97.47 (1.31) 


96.84 (0.87) 


96.62 (1.08) 


96.80 (1.47) 


97.78 (1.27) 


Pedestrians 


98.54 (0.34) 


98.59 (0.71) 


98.69 (0.28) 


98.55 (0.45) 


98.51 (0.36) 


98.67 (0.29) 


98.65 (0.39) 


99.12 (0.35) 


98.73 (0.33) 


Scenes 


99.59 (0.10) 


99.54 (0.21) 


99.57 (0.12) 


99.66 (0.12) 


99.68 (0.10) 


99.61 (0.19) 


99.62 (0.16) 


97.50 (1.07) 


99.66 (0.10) 


Average 


98.63 


98.66 


98.70 


98.69 


98.75 


98.64 


98.56 


98.23 


98.89 



Table 1: Test errors (%) on five real-world data sets. All experiments are run 5 times with 100 boosting iterations. 
The average detection rate and standard deviation (in percentage) at 50% false positives are reported. Best average 
detection rate is shown in boldface. 




margin margin margin 



Fig. 3: Normality test (normal probability plot) for the face data's margin distribution of nodes 1, 2, 3. The 3 
nodes contains 7, 22, 52 weak classifiers respectively. The data are plotted against a theoretical normal distribution 
such that the data which follows the normal distribution model should form a straight line. Curves deviated from 
the straight line (the red line) indicate departures from normality. The larger the number of weak classifiers, the 
more closely the margin follow the Gaussian distribution. 



distribution. From this, we infer that LAC/LDA post- 
processing and thus LACBoost and FisherBoost, can 
be expected to achieve a better performance when a 
larger number of weak classifiers are used. We there- 
fore apply LAC/LDA only within the later nodes (for 
example, 9 onwards) of a multi-exit cascade as these 
nodes contain more weak classifiers. We choose multi- 
exit due to its property^ and effectiveness as reported 
in (Pham et al. 2008). We have compared the multi- 
exit cascade with LDA/LAC post-processing against 
the conventional cascade with LDA/LAC post-proce- 
ssing in (Wu et al. 2008) and performance improvement 
has been observed. 

As in (Wu et al. 2008), five basic types of Haar-like 
features are calculated, resulting in a 162, 336 dimen- 
sional over-complete feature set on an image of 24 x 24 
pixels. To speed up the weak classifier training, as in 
(Wu et al. 2008), we uniformly sample 10% of features 
for training weak classifiers (decision stumps). The face 
data set consists of 9, 832 mirrored 24 x 24 images (Vi- 
ola and Jones 2004) (5, 000 images used for training and 
4, 832 imaged used for validation) and 7, 323 larger res- 
olution background images, as used in (Wu et al. 2008). 

^ Since the multi-exit cascade makes use of all previous 
weak classifiers in earlier nodes, it would meet the Gaussianity 
requirement better than the conventional cascade classifier. 



Several multi-exit cascades are trained with var- 
ious algorithms described above. In order to ensure 
a fair comparison, we have used the same number of 
multi-exit stages and the same number of weak classi- 
fiers. Each multi-exit cascade consists of 22 exits and 
2, 923 weak classifiers. The indices of exit nodes are pre- 
determined to simplify the training procedure. 

For our FisherBoost and LACBoost, we have an im- 
portant parameter 0, which is chosen from {j^, ^j, j^, 
g^}- We have not carefully tuned this pa- 
rameter using cross-validation. Instead, we train a 10- 
node cascade for each candidate 9, and choose the one 
with the best training accuracy."^ At each exit, nega- 
tive examples misclassified by current cascade are dis- 
carded, and new negative examples are bootstrapped 
from the background images pool. In total, billions of 
negative examples are extracted from the pool. The 
positive training data and validation data keep unchang- 
ed during the training process. 

Our experiments are performed on a workstation 
with 8 Intel Xeon E5520 CPUs and 32GB RAM. It 
takes about 3 hours to train the multi-exit cascade with 
AdaBoost or AsymBoost. For FisherBoost and LAC- 
Boost, it takes less than 4 hours to train a complete 

^ To train a complete 22-node cascade and choose the best 
9 on cross-validation data may give better detection rates. 
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Fig. 4: Our face detectors are compared with other asymmetric boosting methods (a) and some state-of-the- 
art including cascade design methods (b) on MIT-I-CMU frontal face test data using ROC curves (number of 
false positives versus detection rate). "Ada" and "Asym" mean that features are selected using AdaBoost and 
AsymBoost, respectively. "VJ" implements Viola and Jones' cascade using AdaBoost (Viola and Jones 2004). 
"MultiExit" means the multi-exit cascade (Pham et al. 2008). The ROC curves of compared methods in (b) are 
quoted from their original papers (Sochman and Matas 2005; Li and Zhang 2004; Xiao et al. 2003; Saberian and 
Vasconcelos 2012). Compared methods are ranked in the legend, based on the average of detection rates. 



multi-exit cascade."* In other words, our EG algorithm 
takes less than 1 hour to solve the primal QP problem 
(we need to solve a QP at each iteration). As an esti- 
mation of the computational complexity, suppose that 
the number of training examples is m, number of weak 
classifiers is n. At each iteration of the cascade train- 
ing, the complexity of solving the primal QP using EG 
is 0{mn -\- kn?) with k the iterations needed for EC's 
convergence. The complexity for training the weak clas- 
sifier is 0{md) with d the number of all Haar- feature 
patterns. In our experiment, m = 10,000, n 2900, 
d = 160, 000, k < 500. So the majority of the compu- 
tational cost of the training process is bound up in the 
weak classifier training. 

We have also experimentally observed the speedup 
of EG against standard QP solvers. We solve the pri- 
mal QP defined by (21) using EG and Mosek (MOSEK 
2010). The QP's size is 1, 000 variables. With the same 
accuracy tolerance (Mosek's primal-dual gap is set to 
10~^ and EG's convergence tolerance is also set to 10~^), 
Mosek takes 1.22 seconds and EG is 0.0541 seconds on 
a standard desktop. So EG is about 20 times faster. 
Moreover, at iteration n + 1 of training the cascade, EG 
can take advantage of the last iteration's solution by 
starting EG from a small perturbation of the previous 

Our implementation is in CH — h and only the weak classi- 
fier training part is parallelized using OpenMP. 



solution. Such a warm-start gains a 5 to 10 x speedup in 
our experiment, while the current QP solver in Mosek 
does not support warm-start (MOSEK 2010, Chapter 
7). 

We evaluate the detection performance on the MIT- 
+CMU frontal face test set. This dataset is made up 
of 507 frontal faces in 130 images with different back- 
ground. 

If one positive output has less than 50% variation 
of shift and scale from the ground-truth, we treat it as 
a true positive, otherwise a false positive. 

In the test phase, the scale factor of the scanning 
window is set to 1.2 and the stride step is set to 1 pixel. 

The Receiver operating characteristic (ROC) curves 
in Fig. 4 show the entire cascade's performance. The 
average detection rate (similar with the one used in 
(Dollar et al. 2012)) are used to rank the compared 
methods, which is the mean of detection rates sampled 
evenly from 50 to 200 false positives. Note that multiple 
factors impact on the cascade's performance, however, 
including: the classifier set, the cascade structure, boot- 
strapping etc. Fig. 4 (a) demonstrate the superior per- 
formance of FisherBoost to other asymmetric boosting 
methods in the face detection task. We can also find 
that LACBoost perform worse than FisherBoost. Wu 
et al. have observed that LAC post-processing does not 
outperform LDA post-processing in some cases either. 
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Algorithm 2 The procedure for training a multi-exit 
cascade with LACBoost or FisherBoost. 

Input; 

— A training set with m examples, which are ordered by 
their labels (mi positive examples followed by m2 nega- 
tive examples); 

— cimin: minimum acceptable detection rate per node; 

— /max: maximum acceptable false positive rate per node; 

— Ftp: target overall false positive rate. 

1 Initialize: 

t = 0; [node index) 

n = 0; [total selected weak classifiers up to the current node) 
Dt = 1; Ft = 1. [overall detection rate and false positive 
rate up to the current node) 

2 while Fip < Ft do 
t = t + 1; [increment node index) 
while dt < rfmin 

do 

[current detection rate dt is not acceptable yet) 

— n = n + 1, and generate a weak classifier and 
update all the weak classifiers' linear coefficient 
using LACBoost or FisherBoost. 

— Adjust threshold b of the current boosted 
strong classifier 



9 
10 



F*(x) = X^«,*/.,(x) 



such that ft /max- 

— Update the detection rate of the current node 
dt with the learned boosted classifier. 

Update Dt+i = Dt y. dt; Ft+i = Ft x ft 
Remove correctly classified negative samples from 
negative training set. 
if Ffp < Ft then 

Evaluate the current cascaded classifier on the 
negative images and add misclassified samples 
into the negative training set; [bootstrap) 

Output: A multi-exit cascade classifier with n weak clas- 
sifiers and t nodes. 



We have also compared our methods with the boost- 
ed greedy sparse LDA (BGSLDA) in (Paisitkriangkrai 
et al. 2009; Shen et al. 2011), which is considered one 
of the state-of-the-art. FisherBoost and LACBoost out- 
perform BGSLDA with AdaBoost/AsymBoost in the 
detection rate. Note that BGSLDA uses the standard 
cascade. 

From Fig. 4 (b), we can see the performance of 
FisherBoost is better than the other considered cascade 
design methods. However, since the parameters of cas- 
cade structure {e.g., node thresholds, number of nodes, 
number of weak classifiers per node) are not carefully 
tuned, our method can not guarantee an optimal trade- 
off between accuracy and speed. We believe that the 
boosting method and the cascade design strategy com- 
pensate each other. Actually in (Saberian and Vascon- 
celos 2010), the authors also incorporate some cost- 
sensitive boosting algorithms, e.g., cost-sensitive Ad- 
aBoost (Masnadi-Shirazi and Vasconcelos 2011), Asym- 



Boost (Viola and Jones 2002), with their cascade design 
method. 



5.4 Pedestrian Detection Using a Cascade Classifier 

We run our experiments on a pedestrian detection with 
a minor modification to visual features being used. We 
evaluate our approach on INRIA data set (Dalai and 
Triggs 2005). The training set consists of 2, 416 cropped 
mirrored pedestrian images and 1, 200 large resolution 
background images. The test set consists of 288 im- 
ages containing 588 annotated pedestrians and 453 non- 
pedestrian images. Each training sample is scaled to 
64 X 128 pixels with an additional of 16 pixels added 
to each border to preserve human contour information. 
During testing, the detection scanning window is re- 
sized to 32 X 96 pixels to fit the human body. We use his- 
togram of oriented gradient (HOG) features in our ex- 
periments. Instead of using fixed-size blocks (105 blocks 
of size 16 X 16 pixels) as in Dalai and Triggs (Dalai and 
Triggs 2005), we define blocks with various scales (from 
12 X 12 pixels to 64 x 128 pixels) and width-length ra- 
tios (1 : 1, 1 : 2, 2 : 1, 1 : 3, and 3 : 1). Each block is 
divided into 2x2 cells, and HOG features in each cell 
are summarized into 9 bins. Hence 36-dimensional HOG 
feature is generated from each block. In total, there are 
7, 735 blocks from a 64 x 128-pixels patch. £i-norm nor- 
malization is then applied to the feature vector. Fur- 
thermore, we use integral histograms to speed up the 
computation as in (Zhu et al. 2006). At each iteration, 
we randomly sample 10% of all the possible blocks for 
training a weak classifier. We have used weighted lin- 
ear discriminant analysis (WLDA) as weak classifiers, 
same as in (Paisitkriangkrai et al. 2008). Zhu et al. used 
linear support vector machines as weak classifiers (Zhu 
et al. 2006), which can also be used as weak classifiers 
here. 

In this experiment, all cascade classifiers have the 
same number of nodes and weak classifiers. For the 
same reason described in the face detection section, 
the FisherBoost/LACBoost and Wu et al.'s LDA/LAC 
post-processing are applied to the cascade from the 3- 
rd node onwards, instead of the first node. The positive 
examples remain the same for all nodes while the neg- 
ative examples in later nodes are obtained by a boot- 
strap approach. The parameter 9 of our FisherBoost 
and LACBoost is selected from {y^, j^, ^j, j^, j^, ^}. 
We have not carefully selected 9 in this experiment. 
Ideally, cross-validation should be used to pick the best 
value of 9 by using an independent cross-validation data 
set. Since there are not many labeled positive training 
data in the INRIA data set, we use the same 2, 416 pos- 
itive examples for validation. We collect 500 additional 
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— 72.46% HOG-Fisher 
1 ' 71.88% HOG-MultiExit-Asym-LDA 

— 70.94% HOG-MultiExit-LDA 

— 69.45% HOG-LACBoost 
5.22%, HOG-MultiExit-Asym-LAC 

■ 67.60% HOG-MulliExit-Asym 

■ 65.35% HOG-MultiExit-LAC 

■ 65.32% HOG-MultiExit-Ada 
62.24% HOG-VJ 



10 10 10 

False positives per image 

Fig. 5: FisherBoost (HOG-Fisher) and LACBoost 
(HOG-LACBoost) are compared with other cascade 
pedestrian detectors on the INRIA data set. AU cas- 
cades are trained with the same number of weak classi- 
fiers and nodes, using HOG features. In the legend, de- 
tectors are sorted based on their log-average detection 
rates. FisherBoost performs best compared to other 
cascades. 



negative examples by bootstrapping for validation. Fur- 
ther improvement is expected if the positive data used 
during validation is different from those used during 
training. During evaluation, we use a step stride of 4 x 4 
pixels with 10 scales per octave (a scale ratio of 1.0718). 
The performance of different cascade detectors is eval- 
uated using a protocol described in (Dollar et al. 2012). 
A technique known as pairwise maximum suppression 
(Dollar 2012) is applied to suppress less confident de- 
tection windows. A confidence score is needed for each 
detection window as the input of pairwise maximum 
suppression. In this work, this confidence is simply cal- 
culated as the mean of decision scores of the last five 
nodes in the cascade. 

The ROC curves are plotted in Fig. 5. Same as 
(Dollar et al. 2012), the log-average detection rate is 
used to summarize overall detection performance, which 
is the mean of detection rates sampled evenly at 9 po- 
sitions from 0.01 to 1. In general, FisherBoost (HOG- 
Fisher) outperforms all other cascade detectors. Simi- 
lar to our previous experiments, LAC and LDA post- 
processing further improve the performance of AdaBoost. 
However, we observe that both FisherBoost and LDA 
post-processing have a better generalization performance 
than LACBoost and LAC post-processing. We will dis- 
cuss this issue at the end of the experiments. 



5.5 Comparison with State-of-the-art Pedestrian 
Detectors 

In this experiment, we compare FisherBoost with state- 
of-the-art pedestrian detectors on several public data 
sets. In (Dollar et al. 2012), the authors compare vari- 
ous pedestrian detectors and conclude that combining 
multiple discriminative features can often significantly 
boost the performance of pedestrian detection. This 
is not surprising since a similar conclusion was drawn 
in (Gehler and Nowozin 2009) on an object recogni- 
tion task. Clearly, the pedestrian detector, which relies 
solely on the HOG feature, is unlikely to outperform 
those using a combination of features. 

To this end, we train our pedestrian detector by 
combining both HOG features (Dalai and Triggs 2005) 
and covariance features (Tuzel et al. 2008)^. For HOG, 
we use the same experimental settings as our previous 
experiment. For covariance features, we use the follow- 
ing image statistics x, y, I, \Iy\, ^ + 1^, 

\Iyy\, arctan(|/a;|/|/j,|) , where x and y are the pixel lo- 
cation, / is the pixel intensity, Ix and ly are first order 
intensity derivatives, Ixx and lyy are second order inten- 
sity derivatives and the edge orientation. Each pixel is 
mapped to a 9-dimensional feature image. We then cal- 
culate 36 correlation coefficients in each block and con- 
catenate these features to previously computed HOG 
features. The new feature not only encodes the gradient 
histogram (edges) but also information of the correla- 
tion of defined statistics inside each spatial layout (tex- 
ture). Similar to the previous experiment, we project 
these new features to a line using weighted linear dis- 
criminant analysis. Except for new features, other train- 
ing and test implementations are the same with those 
in the previous pedestrian detection experiments. 

We first compare FisherBoost (HOGCOV-Fisher) 
with two baseline detectors trained with AdaBoost. The 
first baseline detector is trained with the conventional 
cascade (HOGCOV-VJ) while the second baseline de- 
tector is trained with the multi-exit cascade (HOGCOV- 
MultiExit- Ada) . All detectors are trained with both 
HOG and covariance features on INRIA training set. 
The results on INRIA test sets using the protocol in 
(Dollar et al. 2012) are reported in Fig. 6 (a). Simi- 
lar to previous results, FisherBoost outperforms both 
baseline detectors. 

^ Covariance features capture the relationship between dif- 
ferent image statistics and have been shown to perform well 
in our previous experiments. However, other discriminative 
features can also be used here instead, e.g., Haar-like fea- 
tures. Local Binary Pattern (LBP) (Mu et al. 2008) and self- 
similarity of low- level features (CSS) (Walk et al. 2010). 
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Fig. 6: The performance of our pedestrian detector (HOGCOV-Fisher) compared with (a) basehne detectors and 
(b, c, d) state-of-the-art detectors on pubhcly available pedestrian data sets. Our detector uses HOG and covariance 
features. The performances are ranked using log-average detection rates in the legend. Our detector performs best 
on the INRIA (Dalai and Triggs 2005) data set, second best on the TUD-Brussels (Wojek et al. 2009) and ETH 
(Ess et al. 2007) data sets. Note that the best one on the latter two data sets has either used many more features 
or used a more sophisticated part-based model. 



Our detector is then compared with existing pedes- 
trian detectors listed in (Dollar et al. 2012), on the IN- 
RIA, TUD-Brussels and ETH data sets. For the TUD- 
Brussels and ETH data sets, since sizes of ground-truths 
are smaller than that in INRIA training set, we up- 
sample the original image to 1280 x 960 pixels before 
applying our pedestrian detector. ROC curves and log- 
average detection rates are reported in Fig. 6 (b), (c) 
and (d). On the ETH data set, FisherBoost outperforms 
all the other 14 compared detectors. On the TUD-Bru- 
ssels data set, our detector is the second best, only in- 
ferior to MultiFtr-l-Motion (Walk et al. 2010) that uses 



more discriminative features (gradient, self-similarity 
and motion) than ours. On the INRIA data set, Fisher- 
Boost's performance is also ranked the second, and only 
worse than the part-based detector (Felzenszwalb et al. 
2010) which uses a much more complex model (de- 
formable part models) and training process (latent S- 
VM). We believe that by further combining with more 
discriminative features, e.g., CSS features as used in 
(Walk et al. 2010), the overall detection performance of 
our method can be further improved. In summary, de- 
spite the use of simple HOG plus covariance features, 
our FisherBoost pedestrian detector still achieves the 
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avg. features 


frames/sec. 


FisherBoost - 


- multi-exit 


10.89 


0.186 


AdaBoost - 


- multi-exit 


11.35 


0.166 


AdaBoost + 


VJ cascade 


21.00 


0.109 



Table 2: Average features required per detection win- 
dow and average frames processed per second for differ- 
ent pedestrian detectors on CalTech images of 640 x 480 
pixels (based on our own implementation). 



state-of-the-art performance on public benchmark data 
sets. 

Finally, we report an average number of features 
evaluated per scanning window in Table 2. We compare 
FisherBoost with our implementation of AdaBoost with 
the traditional cascade and AdaBoost with the multi- 
exit cascade. Each image is scanned with 4x4 pixels 
step stride and 10 scales per octave. There are 90, 650 
patches to be classified per image. On a single-core Intel 
i7 CPU 2.8 GHz processor, our detector achieves an av- 
erage speed of 0.186 frames per second (on 640x480 pix- 
els CalTech images), which is ranked eighth compared 
with 15 detectors evaluated in (Dollar et al. 2012). Cur- 
rently, 90% of the total evaluation time is spent on ex- 
tracting both HOG and covariance features (60% of the 
evaluation time is spent on extracting raw HOG and co- 
variance features while another 30% of the evaluation 
time is spent on computing integral images for fast fea- 
ture calculation during scanning phase). 

The major bottleneck of our pedestrian detector lies 
in the feature extraction part. In our implementation, 
we make use of multi-threading to speed up the runtime 
of our pedestrian detector. Using all 8 cores of Intel i7 
CPU, we are able to speed up an average processing 
time to less than 1 second per frame. We believe that 
by using a special purpose hardware, such as Graphic 
Processing Unit (GPU) , the speed of our detector can 
be significantly improved. 

5.5.1 Discussion 

Impact of varying the number of weak classifiers In the 
next experiment, we vary the number of weak classi- 
fiers in each cascade node to evaluate their impact on 
the final detection performance. We train three differ- 
ent pedestrian detectors (Fisher4/5/6, see Table 3 for 
details) on the INRIA data set. We limit the maximum 
number of weak classifiers in each multi-exit node to be 
80. The first two nodes is trained using AdaBoost and 
subsequent nodes are trained using FisherBoost. Fig. 7 
shows ROC curves of different detectors. Although we 
observe a performance improvement as the number of 
weak classifiers increases, this improvement is minor 



compared to a significant increase in the average num- 
ber of features required per detection window. This ex- 
periment indicates the robustness of FisherBoost to the 
number of weak classifiers in the multi-exit cascade. 
Note that FisherS is used in our previous experiments 
on pedestrian detection. 

Impact of training FisherBoost from an early node In 
the previous section, we conjecture that FisherBoost 
performs well when the margin follows the Gaussian 
distribution. As a result, we apply FisherBoost in the 
later node of a multi-exit cascade (as these nodes of- 
ten contain a large number of weak classifiers). In this 
experiment, we show that it is possible to start train- 
ing FisherBoost from the first node of the cascade. To 
achieve this, one can train an additional 50 weak clas- 
sifiers in the first node (to guarantee the margin ap- 
proximately follow the Gaussian distribution) . We con- 
duct an experiment by training two FisherBoost detec- 
tors. In the first detector (Fisher50), FisherBoost is ap- 
plied from the first node onwards. The number of weak 
classifiers in each node is 55, 60 (with 55 weak classi- 
fiers from the first node), 70 (60 weak classifiers from 
previous nodes), 80 (70 weak classifiers from previous 
nodes), etc. In the second detector (Fisher5), we apply 
AdaBoost in the first two nodes and apply FisherBoost 
from the third node onwards. The number of weak clas- 
sifiers in each node is 5, 10 (with 5 weak classifiers from 
the first node), 20 (10 from previous nodes), 30 (20 from 
previous nodes) , etc. Both detectors use the same node 
criterion, i.e., each node should discard at least 50% 
background samples. All other configurations are kept 
to be the same. 

We report the performance of both detectors in Fig. 
7. From the results, Fisher50 performs slightly better 
than Fisher5 (log-average detection rate of 80.38% vs. 
79.61%). Based on these results, classifiers in early nodes 
of the cascade may be heuristically chosen such that a 
large number of easy negative patches can be quickly 
discarded. In other words, the first few nodes can sig- 
nificantly affect the efficiency of the visual detector but 
do not play a significant role in the final detection per- 
formance. Actually, one can always apply simple classi- 
fiers to remove a large percentage of negative windows 
to speed up the detection. 

5.6 Why LDA Works Better Than LAC 

Wu et al. observed that in many cases, LDA post-pro- 
cessing gives better detection rates on MIT-I-CMU face 
data than LAC (Wu et al. 2008). When using the LDA 
criterion to select Haar features, Shen et al. (2011) 
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Fig. 7: Performance comparison, (a) We vary the number of weak classifiers in each multi-exit node. When more 
weak classifiers are used in each node, the accuracy can be slightly improved, (b) We start training FisherBoost 
from the first node (HOGCOV-Fisher50). HOGCOV-Fisher50 can achieve a slightly better detection rate than 
HOGCOV-FisherS. 





Node 1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 onwards 


avg. features 


log-average det. rate 


Fisher4 


4 


4 


8 


8 


16 


16 


32 


32 


64 


64 


80 


80 


26.4 


79.52% 


Fisher5 


5 


5 


10 


10 


20 


20 


40 


40 


80 


80 


80 


80 


26.2 


79.61% 


Fisher6 


6 


6 


12 


12 


24 


24 


48 


48 


80 


80 


80 


80 


30.6 


79.88% 



Table 3: We compare the performance of FisherBoost by varying the number of weak classifiers in each multi-exit 
node. Average features required per detection window and log-average detection rates on the INRIA pedestrian 
dataset are reported. When more weak classifiers in each multi-exit node are used, slightly improved accuracy can 
be achieved at the price of more features being evaluated. 



tried different combinations of the two classes' covari- 
ance matrices for calculating the within-class matrix; 

= Si + SS2 with S being a nonnegative constant. 
It is easy to see that 5 = 1 and 6 = correspond to 
LDA and LAC, respectively. They found that setting 
S e [0.5, 1] gives best results on the MIT-I-CMU face 
detection task (Paisitkriangkrai et al. 2009; Shen et al. 
2011). 

According to the analysis in this work, LAC is opti- 
mal if the distribution of [ft.i(x), /i2(x), • • • , /i,i(x)] on 
the negative data is symmetric. In practice, this require- 
ment may not be perfectly satisfied, especially for the 
first several node classifiers. This may explain why in 
some cases the improvement of LAC is not significant. 
However, this does not explain why LDA (FisherBoost) 
works; and sometimes it performs even better than LAC 
(LACBoost). At the first glance, LDA (or FisherBoost) 
by no means explicitly considers the imbalanced node 
learning objective. Wu et al. did not have a plausible 
explanation either (Wu et al. 2008; 2005). 



Proposition 1 For object detection problems, the Fish- 
er linear discriminant analysis can be viewed as a regu- 
larized version of linear asymmetric classifier. In other 
words, linear discriminant analysis has already consid- 
ered the asymmetric learning objective. In FisherBoost, 
this regularization is equivalent to having a i2-norm 
penalty on the primal variable w in the objective func- 
tion of the QP problem in Section 4- Having the £2- 
norm regularization, ||w||2, avoids over-fitting and in- 
creases the robustness of FisherBoost. This similar pe- 
nalty is also used in machine learning algorithms such 
as Ridge regression (also known as Tikhonov regular- 
ization). 



For object detection such as face and pedestrian de- 
tection considered here, the covariance matrix of the 
negative class is close to a scaled identity matrix. In 
theory, the negative data can be anything other than 
the target. Let us look at one of the off-diagonal ele- 



18 



Chunhua Shen et al. 





5 = (LACBoost) 


5 = 0.1 


5 = 0.2 


5 = 0.5 


(5=1 (FisherBoost) 


Digits 


99.12 (0.1) 


99.57 (0.2) 


99.57 (0.1) 


99.55 (0.1) 


99.40 (0.1) 


Faces 


98.63 (0.3) 


98.82 (0.3) 


98.84 (0.2) 


98.48 (0.4) 


98.89 (0.2) 


Cars 


96.80 (1.5) 


97.47 (1.1) 


97.69 (1.2) 


97.96 (1.2) 


97.78 (1.2) 


Pedestrians 


99.12 (0.4) 


99.31 (0.1) 


99.22 (0.1) 


99.13 (0.3) 


98.73 (0.3) 


Scenes 


97.50 (1.1) 


98.30 (0.6) 


98.62 (0.7) 


99.16 (0.4) 


99.66 (0.1) 


Average (%) 


98.23 


98.69 


98.79 


98.86 


98.89 



Table 4: The average detection rate and its standard deviation (in %) at 50% false positives. We vary the value of 
6, which balances the ratio between positive and negative class's covariance matrices. 




20 40 60 80 100 
covariance of weak classifers on non-pedestrian data 

Fig. 8: The covariance matrix of the first 112 weak clas- 
sifiers selected by FisherBoost on non-pedestrian data. 
It may be approximated by a scaled identity matrix. 
On average, the magnitude of diagonal elements is 20 
times larger than those off-diagonal elements. 



ments 

Sij^.^j = E[(/i,(x) - E[/i,(x)])(/i,(x) - E[/i,(x)])] 

= E[/ii(x)/ij(x)] « 0. (24) 

Here x is the image feature of the negative class. We 
can assume that x is i.i.d. and approximately, x follows 
a symmetric distribution. So E[/ii ,, (x)] = 0. That is to 
say, on the negative class, the chance of hij{x) = +1 
or hij{x) = —1 is the same, which is 50%. Note that 
this does not apply to the positive class because x of 
the positive class is not symmetrically distributed, in 
general. The last equality of (24) uses the fact that weak 
classifiers hi(-) and hj{-) are approximately statistically 
independent. Although this assumption may not hold 
in practice as pointed out in (Shen and Li 2010b), it 
could be a plausible approximation. 

Therefore, the off-diagonal elements of S are al- 
most all zeros; and S is a diagonal matrix. Moreover 
in object detection, it is a reasonable assumption that 
the diagonal elements E[hj{K)hj{x)] {j — 1,2, ■ ■ ■ ) have 



similar values. Hence, 1^2 ~ vl holds, with v being a 
small positive constant. 

So for object detection, the only difference between 
LAC and LDA is that, for LAC, C^u = and for 

In summary, LDA-like approaches {e.g., LDA post- 
processing and FisherBoost) perform better than LAC- 
like approaches {e.g., LAC and LACBoost) in object 
detection due to two main reasons. The first reason is 
that LDA is a regularized version of LAC. The sec- 
ond reason is that the negative data are not necessarily 
symmetrically distributed. Particularly, in latter nodes, 
bootstrapping forces the negative data to be visually 
similar the positive data. In this case, ignoring the neg- 
ative data's covariance information is likely to deterio- 
rate the detection performance. 

Fig. 8 shows some empirical evidence that S2 is 
close to a scaled identity matrix. As we can see, the di- 
agonal elements are much larger than those off-diagonal 
elements (off-diagonal ones are close to zeros). 

In this experiment, we evaluate the impact of the 
regularization parameter by varying the value of S, which 
balances the ratio between positive and negative class's 
covariance matrices, i.e., = ^i+SS2'i and also Q = 

Qi " 



(5Q2 
Qi 




Setting 6 = corresponds to LACBoost, 
while setting S = I corresponds to Fisher- 
Boost, Q = 



Qi 
Q2 

We conduct our experiments on 5 visual data sets 
by setting the value of S to be {0, 0.1, 0.2, 0.5, 1}. All 
5 classifiers are trained to remove 50% of the negative 
data, while retaining almost all positive data. We com- 
pare their detection rate in Table 4. First, in general, we 
observe performance improvement when we set 6 to be 
a small positive value. Since setting (5 to be 1 happens 
to coincide with the LDA objective criterion, the LDA 
classifier also inherits the node learning goal of LAC 
in the context of object detection. Second, on different 
datasets, in theory this parameter should be cross val- 
idated and setting it to be 1 (FisherBoost) does not 
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always give the best performance, which is not surpris- 
ing. 

At this point, a hypothesis naturally arises: If regu- 
larization is really the reason why LACBoost underper- 
forms FisherBoost, then applying other forms of regu- 
larization to LACBoost would also he likely to improve 
LACBoost. Our last experiment tries to verify this hy- 
pothesis. 

Here we regularize the matrix Q by adding an ap- 
propriately scaled identity matrix Q -I- 51. As discussed 
in Section 4.1, from a numerical stability point of view, 
difficulties arise when Q is rank-deficient, causing the 
dual solution to (18) to be non-uniquely defined. This 
issue is much worse for LACBoost because the lower- 
block of Q, i.e., Q2 is a zero matrix. In that case, a 
well-defined problem can be obtained by replacing Q 
with Q + 51. This can be interpreted as corresponding 
to the primal-regularized QP (refer to (16)): 

min \p^Clp-ee^p + 5\\p\\l, 

s.t. w ^ 0, l^w = 1, 

Pi = (Aw)i, j = 1, • • • ,m. (25) 

Clearly here in the primal, we are applying the Tikhonov 
I2 norm regularization to the variable p. Also we expect 
accuracy improvement with this regularization because 
the margin variance is minimized by minimizing the 
€2 norm of the margin while maximizing the weighted 
mean of the margin, i.e., b' p. Thus a better margin 
distribution may be achieved (Shen and Li 2010a;b). 

Now we evaluate the impact of the regularization 
parameter b by running experiments on the same data- 
sets as in the last experiment. We vary the values of 
b and the results of detection accuracy are reported in 
Table 5. Again, the 5 classifiers are trained to remove 
50% of the negative data, while correctly classifying as 
most positive data as possible. As can be seen, indeed, 
regularization often improves the results. Note that in 
the experiments, we have solved the primal optimiza- 
tion problem so that even when Q is not invertible, we 
can still obtain a solution. Having the primal solutions, 
the dual solutions are obtained using (23). This experi- 
ment demonstrates that other formats of regularization 
indeed improves LACBoost too. 

6 Conclusion 

By explicitly taking into account the node learning goal 
in cascade classifiers, we have designed new boosting 
algorithms for more effective object detection. 

Experiments validate the superiority of the meth- 
ods developed, which we have labeled FisherBoost and 



LACBoost. We have also proposed the use of entropic 
gradient descent to efficiently implement FisherBoost 
and LACBoost. The proposed algorithms are easy to 
implement and can be applied to other asymmetric clas- 
sification tasks in computer vision. We aim in future 
to design new asymmetric boosting algorithms by ex- 
ploiting asymmetric kernel classification methods such 
as (Tu and Lin 2010). Compared with stage-wise Ad- 
aBoost, which is parameter-free, our boosting algorithms 
need to tune a parameter. 

We are also interested in developing parameter-free 
stage-wise boosting that considers the node learning 
objective. Moreover, the developed boosting algorithms 
only work for the case 70 < 0.5 in (2). How can we 
make it work for 70 > 0.5? Last, to relax the symmetric 
distribution requirement for the feature responses of the 
negative class is also a topic of interest. 
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(5 = (LACBoost) 


5 = 5 X lO"'' 


<5 = 2 X lO"'' 


s = w-'^ 


5 = 5 X 10-"^ 


<5 = 2 X 10"^ 


S = 10"^ 


Digits 


99.12 (0.1) 


99.50 (0.1) 


99.41 (0.2) 


99.59 (0.2) 


99.60 (0.2) 


99.50 (0.3) 


99.11 (0.5) 


Faces 


98.63 (0.3) 


98.73 (0.0) 


98.87 (0.0) 


99.02 (0.0) 


98.38 (0.0) 


98.84 (0.0) 


99.04 (0.0) 


Cars 


96.80 (1.5) 


96.62 (1.5) 


96.80 (1.5) 


96.80 (1.5) 


96.67 (1.4) 


96.58 (1.5) 


96.71 (1.5) 


Pedestrians 
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Scenes 
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Average (%) 
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98.50 
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Table 5: The average detection rate and its standard deviation (in %) at 50% false positives of various regularized 
LACBoosts. We vary the value of 6, i.e., Q + SI. Regularization often improves the overall detection accuracy. 
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A Proof of Theorem 1 

Before we present our results, we introduce an impor- 
tant proposition from (Yu et al. 2009). Note that we 
have used different notation. 

Proposition 2 For a few different distribution fami- 
lies, the worst-case constraint 



inf Prjw^x < b} 

x~(/x,Z') 

can be written as: 



>7, 



(26) 



1. if X ^ {fj,,Z!), i.e., x follows an arbitrary distribu- 
tion with mean fi and covariance S, then 



_T_ . Vw^iJw; 

i — 7 ' 



b > (1 
2. if X ^ (/X, -S')s,^ then we have 



(27) 



b>w^ti+^^^yVw^Sw, z/ 7 6 (0.5,1); 
b>w^ti, z/ 7 e (0,0.5]; 

(28) 

3. ifyi^ ifi, i7)su; then 



b > w^/i,, 



Vw^Smv, i/7e (0.5,1); 

*/7e (0,0.5]; 

(29) 



if X follows a Gaussian distribution with mean fi 
and covariance S , i.e., x ^ Qil^-, S), then 



b>w^fi + 4i ^7) • V-w^Sw, 



(30) 



where (/){■) is the cumulative distribution function 
(c.d.f.) of the standard normal distribution Q{Q, 1), 
and (j>~^{-) is the inverse function of (/){■). 
Two useful observations about (j>~^{-) are: (f>~^{0.5) = 
0; and (f>~^{-) is a monotonically increasing function 
in its domain. 

^ Here (fi, S)^ denotes the family of distributions in (fi, S) 
that are also symmetric about the mean p. (p, £')su denotes 
the family of distributions in (/x, S) that are additionally sym- 
metric and linear unimodal about /x. 



22 



Chunhua Shen et al. 



We omit the proof of Proposition 2 here and refer the 
reader to (Yu et al. 2009) for details. Next we begin to 
prove Theorem 1: 

Proof The second constraint of (2) is simply 
b>yv^H2- (31) 



The first constraint of (2) can be handled by writing 
w^Xi > & as — w^Xi < —b and applying the results in 
Proposition 2. It can be written as 



(32) 



with (6). 

Let us assume that Si is strictly positive definite 
(if it is only positive semidefinite, we can always add a 
small regularization to its diagonal components). From 
(32) we have 



-b + w^/Xj 



a/ W^Z'iW 

So the optimization problem becomes 
max 7, s.t. (31) and (33). 

w,b,7 



(33) 



(34) 



The maximum value of 7 (which we label 7*) is 
achieved when (33) is strictly an equality. To illustrate 
this point, let us assume that the maximum is achieved 
when 



^(7*) < 



—b + fii 



Then a new solution can be obtained by increasing 7* 
with a positive value such that (33) becomes an equal- 
ity. Notice that the constraint (31) will not be affected, 
and the new solution will be better than the previous 
one. Hence, at the optimum, (5) must be fulfilled. 

Because (/'(t) is monotonically increasing /or all the 
four cases in its domain (0, 1) (see Fig. 9), maximizing 
7 is equivalent to maximizing 1^9(7) and this results in 



max 

w,6 



, s.t. b > w^/Xj. 



(35) 



As in (Lanckriet et al. 2002; Huang et al. 2004), we 
also have a scale ambiguity: if (w'^,5*) is a solution, 
{tw*,tb*) with i > is also a solution. 

An important observation is that the problem (35) 
must attain the optimum at (4). Otherwise if 6 > w^/Xj, 
the optimal value of (35) must be smaller. So we can 
rewrite (35) as an unconstrained problem (3). 

We have thus shown that, if Xi is distributed ac- 
cording to a symmetric, symmetric unimodal, or Gaus- 
sian distribution, the resulting optimization problem is 



3F 
2 



t » 



/ ' I 



-1 

-2[ 



-3t ^ 

0.1 



0.3 



0.5 

7 



-<,fgnrl(7) 

-'Psil) 
-'</'su(7) 

0.7 



0.9 1 



Fig. 9: The function (p(-) in (6). The four curves cor- 
respond to the four cases. They are all monotonically 
increasing in (0, 1). 

identical. This is not surprising considering the latter 
two cases are merely special cases of the symmetric dis- 
tribution family. 

At optimality, the inequality (33) becomes an equal- 
ity, and hence 7* can be obtained as in (5). For ease of 
exposition, let us denote the fours cases in the right 
side of (6) as <Pgmi(-)- fsi'), 'Psvi'), and (pg{-). For 
7 € [0.5,1), as shown in Fig. 9, we have V'gnri(7) > 
Vsil) > fsvil) > Ve(7)- Therefore, when solving (5) 
for 7*, we have 7*„j,j < 7s < 7su < 7e- That is to say, 
one can get better accuracy when additional informa- 
tion about the data distribution is available, although 
the actual optimization problem to be solved is identi- 
cal. 



B Proof of Theorem 2 

Let us assume that in the current solution we have se- 
lected n weak classifiers and their corresponding linear 
weights are w — [wi, ■ ■ ■ ,Wn]- If we add a weak clas- 
sifier h'{-) that is not in the current subset, the cor- 
responding w is zero, then we can conclude that the 
current weak classifiers and w are the optimal solu- 
tion already. In this case, the best weak classifier that 
is found by solving the subproblem (20) does not con- 
tribute to solving the master problem. 

Let us consider the case that the optimality condi- 
tion is violated. We need to show that we are able to 
find such a weak learner /i'(-), which is not in the set of 
current selected weak classifiers, that its corresponding 
coeflicient w > holds. Again assume h'{-) is the most 
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violated weak learner found by solving (20) and the 
convergence condition is not satisfied. In other words, 
we have 

m 

^u,y,h'{x,) >r. (36) 

i=l 

Now, after this weak learner is added into the master 
problem, the corresponding primal solution w must be 
non-zero (positive because we have the nonnegative- 
ness constraint on w). 

If this is not the case, then the corresponding w = 
0. This is not possible because of the following rea- 
son. From the Lagrangian (17), at optimality we have 
dL/dw = 0, which leads to 

m 

r -^UiUih'ixi) ^ q > 0. (37) 
1=1 

Clearly (36) and (37) contradict. 

Thus, after the weak classifier h'{-) is added to the 
primal problem, its corresponding w must have a pos- 
itive solution. This is to say, one more free variable is 
added into the problem and re-solving the primal prob- 
lem (16) must reduce the objective value. Therefore a 
strict decrease in the objective is obtained. In other 
words. Algorithm 1 must make progress at each itera- 
tion. Furthermore, the primal optimization problem is 
convex, there are no local optimal points. The column 
generation procedure is guaranteed to converge to the 
global optimum up to some prescribed accuracy. 



3. Stop if some stopping criteria are met. 
The learning step-size can be determined by 
_ V21ogn 1 

following (Beck and Teboulle 2003). In (Collins et al. 
2008), the authors have used a simpler strategy to set 
the learning rate. 

In EG there is an important parameter L f , which is 
used to determine the step-size. Lf can be determined 
by the ^oo-norm of |/'(w)|. In our case /'(w) is a linear 
function, which is trivial to compute. The convergence 
of EG is guaranteed; see (Beck and Teboulle 2003) for 
details. 



C Exponentiated Gradient Descent 

Exponentiated Gradient Descent (EG) is a very useful 
tool for solving large-scale convex minimization prob- 
lems over the unit simplex. Let us first define the unit 
simplex Z\„ = {w e M" : l^w = l,w ^ 0}. EG effi- 
ciently solves the convex optimization problem 

min /(w), s.t. w G Z\„, (38) 

w 

under the assumption that the objective function /(•) 
is a convex Lipschitz continuous function with Lips- 
chitz constant Lf w.r.t. a fixed given norm ||-||. The 
mathematical definition of L/ is that |/(w) — /(z)| < 
L/||x — z|| holds for any x,z in the domain of /(■). The 
EG algorithm is very simple: 

1. Initialize with w° S the interior of Z\„; 

2. Generate the sequence {w'=}, fc = 1,2, • • • with: 



. wr^exp[-../-(w^^^)] 
Lj=iWj- exp[-rfc/j(w'' 1^ 



Here Tk is the step-size, /'(w) = [/{ (wr), . . . , /^(w)]^ 
is the gradient of /(•); 



